[AMDGPU] Try to reuse in v_cndmask register with constant from compare. #131146

dfukalov · 2025-03-13T14:06:09Z

For some targets, the optimization X == Const ? X : Y → X == Const ? Const : Y can cause extra register usage for the constant in v_cndmask. This patch detects such cases and reuses the register from the compare instruction that already holds the constant, instead of materializing it again.

For SWDEV-506659.

For some targets, the optimization `X == Const ? X : Y` → `X == Const ? Const : Y` can cause extra register usage for the constant in `v_cndmask`. This patch detects such cases and reuses the register from the compare instruction that already holds the constant, instead of materializing it again. For SWDEV-506659.

llvmbot · 2025-03-13T14:06:41Z

@llvm/pr-subscribers-backend-amdgpu

Author: Daniil Fukalov (dfukalov)

Changes

For some targets, the optimization X == Const ? X : Y → X == Const ? Const : Y can cause extra register usage for the constant in v_cndmask. This patch detects such cases and reuses the register from the compare instruction that already holds the constant, instead of materializing it again.

For SWDEV-506659.

Full diff: https://github.com/llvm/llvm-project/pull/131146.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+69-4)
(added) llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll (+171)

diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 91df516b80857..f2857cd381c7e 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1411,15 +1411,80 @@ bool SIFoldOperandsImpl::tryFoldCndMask(MachineInstr &MI) const {
       Opc != AMDGPU::V_CNDMASK_B64_PSEUDO)
     return false;
 
+  // Try to find optimized Y == Const ? Const : Z. If Const can't be directly
+  // encoded in the cndmask, try to reuse a register already holding the Const
+  // value from the comparison instruction.
+  auto tryFoldCndMaskCmp =
+      [&](MachineOperand *SrcOp, std::optional<int64_t> SrcImm,
+          unsigned CmpOpcodes[4], AMDGPU::OpName CmpValName) -> bool {
+    // We'll try to process only register operands with known values.
+    if (!SrcImm || !SrcOp->isReg())
+      return false;
+
+    // Find the predicate of the cndmask instruction.
+    MachineOperand *PredOp = TII->getNamedOperand(MI, AMDGPU::OpName::src2);
+    if (!PredOp || !PredOp->isReg())
+      return false;
+
+    MachineInstr *PredI = MRI->getVRegDef(PredOp->getReg());
+    if (!PredI || !PredI->isCompare())
+      return false;
+
+    unsigned CmpOpc = PredI->getOpcode();
+
+    if (CmpOpc != CmpOpcodes[0] && CmpOpc != CmpOpcodes[1] &&
+        CmpOpc != CmpOpcodes[2] && CmpOpc != CmpOpcodes[3])
+      return false;
+
+    // Check if the immediate value of the source operand matches the immediate
+    // value of either the first or second operand of the comparison
+    // instruction.
+    MachineOperand *SubstOp = nullptr;
+    std::optional<int64_t> CmpValImm = getImmOrMaterializedImm(
+        *TII->getNamedOperand(*PredI, AMDGPU::OpName::src0));
+    if (CmpValImm && *CmpValImm == *SrcImm) {
+      SubstOp = TII->getNamedOperand(*PredI, AMDGPU::OpName::src1);
+    } else {
+      CmpValImm = getImmOrMaterializedImm(
+          *TII->getNamedOperand(*PredI, AMDGPU::OpName::src1));
+      if (CmpValImm && *CmpValImm == *SrcImm) {
+        SubstOp = TII->getNamedOperand(*PredI, AMDGPU::OpName::src0);
+      } else {
+        return false;
+      }
+    }
+
+    if (!SubstOp || !SubstOp->isReg())
+      return false;
+
+    LLVM_DEBUG(dbgs() << "Folded " << MI << " into ");
+    SrcOp->setReg(SubstOp->getReg());
+    LLVM_DEBUG(dbgs() << MI);
+    return true;
+  };
+
   MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
   MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
   if (!Src1->isIdenticalTo(*Src0)) {
-    std::optional<int64_t> Src1Imm = getImmOrMaterializedImm(*Src1);
-    if (!Src1Imm)
-      return false;
+    // Try to fold with not-equal comparisons
+    unsigned NECmpOpcodes[4] = {
+        AMDGPU::V_CMP_NEQ_F32_e64, AMDGPU::V_CMP_LG_F32_e64,
+        AMDGPU::V_CMP_NE_I32_e64, AMDGPU::V_CMP_NE_U32_e64};
 
     std::optional<int64_t> Src0Imm = getImmOrMaterializedImm(*Src0);
-    if (!Src0Imm || *Src0Imm != *Src1Imm)
+    if (tryFoldCndMaskCmp(Src0, Src0Imm, NECmpOpcodes, AMDGPU::OpName::src1))
+      return true;
+
+    // Try to fold with equal comparisons
+    unsigned EQCmpOpcodes[4] = {
+        AMDGPU::V_CMP_EQ_F32_e64, AMDGPU::V_CMP_EQ_F64_e64,
+        AMDGPU::V_CMP_EQ_I32_e64, AMDGPU::V_CMP_EQ_U32_e64};
+
+    std::optional<int64_t> Src1Imm = getImmOrMaterializedImm(*Src1);
+    if (tryFoldCndMaskCmp(Src1, Src1Imm, EQCmpOpcodes, AMDGPU::OpName::src0))
+      return true;
+
+    if (!Src0Imm || !Src1Imm || *Src0Imm != *Src1Imm)
       return false;
   }
 
diff --git a/llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll b/llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll
new file mode 100644
index 0000000000000..0b7106448bb7c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll
@@ -0,0 +1,171 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs  < %s | FileCheck %s -check-prefix=GFX9
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1030 -verify-machineinstrs  < %s | FileCheck %s -check-prefix=GFX10
+
+define float @f32_oeq_v_i(float %arg, float %arg1) {
+; GFX9-LABEL: f32_oeq_v_i:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x3e7ae148
+; GFX9-NEXT:    v_cmp_neq_f32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: f32_oeq_v_i:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_neq_f32_e32 vcc_lo, 0x3e7ae148, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x3e7ae148, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %fcmp = fcmp oeq float %arg, 0x3FCF5C2900000000
+  %select = select i1 %fcmp, float 0x3FCF5C2900000000, float %arg1
+  ret float %select
+}
+
+define float @f32_oeq_i_v(float %arg, float %arg1) {
+; GFX9-LABEL: f32_oeq_i_v:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x3e7ae148
+; GFX9-NEXT:    v_cmp_neq_f32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: f32_oeq_i_v:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_neq_f32_e32 vcc_lo, 0x3e7ae148, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x3e7ae148, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %fcmp = fcmp oeq float 0x3FCF5C2900000000, %arg
+  %select = select i1 %fcmp, float 0x3FCF5C2900000000, float %arg1
+  ret float %select
+}
+
+define float @f32_one_v_i(float %arg, float %arg1) {
+; GFX9-LABEL: f32_one_v_i:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x3e7ae148
+; GFX9-NEXT:    v_cmp_lg_f32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: f32_one_v_i:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_lg_f32_e32 vcc_lo, 0x3e7ae148, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x3e7ae148, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %fcmp = fcmp one float %arg, 0x3FCF5C2900000000
+  %select = select i1 %fcmp, float %arg1, float 0x3FCF5C2900000000
+  ret float %select
+}
+
+define float @f32_one_i_v(float %arg, float %arg1) {
+; GFX9-LABEL: f32_one_i_v:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x3e7ae148
+; GFX9-NEXT:    v_cmp_lg_f32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: f32_one_i_v:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_lg_f32_e32 vcc_lo, 0x3e7ae148, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x3e7ae148, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %fcmp = fcmp one float %arg, 0x3FCF5C2900000000
+  %select = select i1 %fcmp, float %arg1, float 0x3FCF5C2900000000
+  ret float %select
+}
+
+define i32 @i32_eq_v_i(i32 %arg, i32 %arg1) {
+; GFX9-LABEL: i32_eq_v_i:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x67932
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: i32_eq_v_i:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0x67932, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x67932, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %icmp = icmp eq i32 %arg, 424242
+  %select = select i1 %icmp, i32 424242, i32 %arg1
+  ret i32 %select
+}
+
+define i32 @i32_eq_i_v(i32 %arg, i32 %arg1) {
+; GFX9-LABEL: i32_eq_i_v:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x67932
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: i32_eq_i_v:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0x67932, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x67932, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %icmp = icmp eq i32 424242, %arg
+  %select = select i1 %icmp, i32 424242, i32 %arg1
+  ret i32 %select
+}
+
+define i32 @i32_ne_v_i(i32 %arg, i32 %arg1) {
+; GFX9-LABEL: i32_ne_v_i:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x67932
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: i32_ne_v_i:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0x67932, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x67932, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %icmp = icmp ne i32 %arg, 424242
+  %select = select i1 %icmp, i32 %arg1, i32 424242
+  ret i32 %select
+}
+
+define i32 @i32_ne_i_v(i32 %arg, i32 %arg1) {
+; GFX9-LABEL: i32_ne_i_v:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x67932
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: i32_ne_i_v:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0x67932, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x67932, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %icmp = icmp ne i32 424242, %arg
+  %select = select i1 %icmp, i32 %arg1, i32 424242
+  ret i32 %select
+}

arsenm

Can we do this earlier, before selection? The same problem exists for the scalar case

arsenm · 2025-03-13T14:08:53Z

llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll

+  %icmp = icmp ne i32 424242, %arg
+  %select = select i1 %icmp, i32 %arg1, i32 424242
+  ret i32 %select
+}


Test half, i16, and bfloat cases. Plus 64-bit

Test half, i16, and bfloat cases. Plus 64-bit

I added half and i16 types. For bloat case we'll get no advantage here, since the imm is stored in two registers in any case: one (shifted left) for compare and second (original) for cndmask, like:

v_lshlrev_b32_e32 v2, 16, v0 s_mov_b32 s4, 0x42420000 v_cmp_eq_f32_e32 vcc, s4, v2 v_cndmask_b32_e32 v0, v1, v0, vcc

For the 64-bit types this folding doesn't work yet, since they are lowered into different pattern cmp/cndmask with pairs of registers (with REG_SEQUENCEs). I would implement it incrementally in additional patch.

arsenm · 2025-03-13T14:09:08Z

llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll

+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs  < %s | FileCheck %s -check-prefix=GFX9
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1030 -verify-machineinstrs  < %s | FileCheck %s -check-prefix=GFX10
+


What about with inline immediate? It shouldn't be any better?

For the cases the immediate is already inlined into v_cndmask if it was possible. E.g. in the test they are successfully inlined for gfx1030 before the moment when the patch processes v_cndmask. In such cases it just skips the instruction.

llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll

dfukalov · 2025-03-13T14:42:59Z

Can we do this earlier, before selection? The same problem exists for the scalar case

I tried to detect these cases and revert the optimisation in amdgpu-codegenprepare, but realised that in some cases (e.g. when immediate can be encoded into instruction, depending on the target and the constant) it can increase registers usage (as well as instructions count).

Co-authored-by: Matt Arsenault <[email protected]>

jayfoad · 2025-03-13T16:20:01Z

Is this going to match X == +0.0 ? X : Y --> X == +0.0 ? +0.0 : Y? That would be wrong, because if X is -0.0 then the first one evaluates to -0.0 and the second one evaluates to +0.0.

arsenm · 2025-03-13T16:23:47Z

I tried to detect these cases and revert the optimisation in amdgpu-codegenprepare, but realised that in some cases (e.g. when immediate can be encoded into instruction, depending on the target and the constant) it can increase registers usage (as well as instructions count).

That would be too late, plenty of compare and select patterns appear during legalization

dfukalov · 2025-03-17T21:08:31Z

Is this going to match X == +0.0 ? X : Y -> X == +0.0 ? +0.0 : Y? That would be wrong, because if X is -0.0 then the first one evaluates to -0.0 and the second one evaluates to +0.0.

X == Const ? X : Y -> X == Const ? Const : Y optimisation is performed in InstCombine on llvm ir. Here I try to dial with cases it can degrade performance.

I double-checked the case X == +0.0 ? X : Y and actually InstCombine doesn't touch such operations and doesn't transform in to X == +0.0 ? +0.0 : Y, so there is no such issue.

dfukalov · 2025-03-17T21:21:05Z

I tried to detect these cases and revert the optimisation in amdgpu-codegenprepare, but realised that in some cases (e.g. when immediate can be encoded into instruction, depending on the target and the constant) it can increase registers usage (as well as instructions count).

That would be too late, plenty of compare and select patterns appear during legalization

I guess the exact pattern with same imms in two registers in cmp/select is not quite common. Moreover, in case we're getting it at the folding stage, it would be also profitable to spare a register.

jayfoad · 2025-03-18T09:28:35Z

Is this going to match X == +0.0 ? X : Y -> X == +0.0 ? +0.0 : Y? That would be wrong, because if X is -0.0 then the first one evaluates to -0.0 and the second one evaluates to +0.0.

X == Const ? X : Y -> X == Const ? Const : Y optimisation is performed in InstCombine on llvm ir. Here I try to dial with cases it can degrade performance.

I double-checked the case X == +0.0 ? X : Y and actually InstCombine doesn't touch such operations and doesn't transform in to X == +0.0 ? +0.0 : Y, so there is no such issue.

OK, so InstCombine is not broken, but this patch will try to do the equivalent transformation, so this patch is broken.

dfukalov · 2025-03-18T17:28:05Z

Is this going to match X == +0.0 ? X : Y -> X == +0.0 ? +0.0 : Y? That would be wrong, because if X is -0.0 then the first one evaluates to -0.0 and the second one evaluates to +0.0.

X == Const ? X : Y -> X == Const ? Const : Y optimisation is performed in InstCombine on llvm ir. Here I try to dial with cases it can degrade performance.
I double-checked the case X == +0.0 ? X : Y and actually InstCombine doesn't touch such operations and doesn't transform in to X == +0.0 ? +0.0 : Y, so there is no such issue.

OK, so InstCombine is not broken, but this patch will try to do the equivalent transformation, so this patch is broken.

This patch deals with result of InstCombine that started to optimize more cases and tries to revert this transform at machine IR stage (in terms of ternary operation X == Const ? Const : Y -> X == Const ? X : Y).

I've just added tests to check this case:
For X == +0.0 ? X : Y case InstCombine doesn't transform it and so the patch doesn't process such cndmask instruction.
For X == +0.0 ? +0.0 : Y case InstCombine also doesn't transform it, but imm 0.0 is encoded into cmp and cndmask instructions before the patch so it just ignores the case.

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

jayfoad · 2025-03-19T11:14:14Z

Is this going to match X == +0.0 ? X : Y -> X == +0.0 ? +0.0 : Y? That would be wrong, because if X is -0.0 then the first one evaluates to -0.0 and the second one evaluates to +0.0.

X == Const ? X : Y -> X == Const ? Const : Y optimisation is performed in InstCombine on llvm ir. Here I try to dial with cases it can degrade performance.
I double-checked the case X == +0.0 ? X : Y and actually InstCombine doesn't touch such operations and doesn't transform in to X == +0.0 ? +0.0 : Y, so there is no such issue.

OK, so InstCombine is not broken, but this patch will try to do the equivalent transformation, so this patch is broken.

This patch deals with result of InstCombine that started to optimize more cases and tries to revert this transform at machine IR stage (in terms of ternary operation X == Const ? Const : Y -> X == Const ? X : Y).

I've just added tests to check this case: For X == +0.0 ? X : Y case InstCombine doesn't transform it and so the patch doesn't process such cndmask instruction. For X == +0.0 ? +0.0 : Y case InstCombine also doesn't transform it, but imm 0.0 is encoded into cmp and cndmask instructions before the patch so it just ignores the case.

This pass should handle all MIR correctly, not just the MIR that we currently see as a result of how InstCombine behaves today.

Co-authored-by: Jay Foad <[email protected]>

dfukalov · 2025-03-20T17:02:40Z

This pass should handle all MIR correctly, not just the MIR that we currently see as a result of how InstCombine behaves today.

Actually the pass doesn't wait a result of InstCombine only, but checks the pattern for a v_cndmask instruction like:

%10:sgpr_32 = S_MOV_B32 1048240456
%11:sreg_64_xexec = nofpexcept V_CMP_NEQ_F32_e64 0, %8:vgpr_32, 0, %10:sgpr_32, 0, implicit $mode, implicit $exec
%13:vgpr_32 = V_MOV_B32_e32 1048240456, implicit $exec
%12:vgpr_32 = V_CNDMASK_B32_e64 0, %13:vgpr_32, 0, %9:vgpr_32, killed %11:sreg_64_xexec, implicit $exec

For both non-mask parameters of the instruction it:

checks if the paramer is actually a reg that contains imm
gets def of the mask parameter
if it is one of supported compare operations, checks, that it is actually comparison with the same imm allocated in a register
regarding EQ or NEQ compare reuses this register in cndmask ubstruction, so in the previous pattern %8 is used instead of %13:

%10:sgpr_32 = S_MOV_B32 1048240456
%11:sreg_64_xexec = nofpexcept V_CMP_NEQ_F32_e64 0, %8:vgpr_32, 0, %10:sgpr_32, 0, implicit $mode, implicit $exec
%13:vgpr_32 = V_MOV_B32_e32 1048240456, implicit $exec
%12:vgpr_32 = V_CNDMASK_B32_e64 0, %8:vgpr_32, 0, %9:vgpr_32, killed %11:sreg_64_xexec, implicit $exec

Or, did you mean the test should be reworked to MIR tests?

jayfoad

Here is a test case that your patch will break because it does not handle +0/-0 correctly.

define float @f32_oeq_v_negzero(float %arg, float %arg1) {
bb:
  %fcmp = fcmp oeq float %arg, 0x8000000000000000
  %select = select i1 %fcmp, float 0x8000000000000000, float %arg1
  ret float %select
}

dfukalov · 2025-03-21T18:53:22Z

Here is a test case that your patch will break because it does not handle +0/-0 correctly.

Yes, you're right, thanks!
Now it checks for +0/-0 float point imm and doesn't use its reg now.

dfukalov · 2025-04-09T13:26:41Z

Ping...

dfukalov · 2025-04-29T16:04:29Z

Ping...

dfukalov · 2025-05-16T11:55:07Z

@jayfoad would you please take a look at the patch?

…dmask (#148740) For some targets, the optimization X == Const ? X : Y -> X == Const ? Const : Y can cause extra register usage or redundant immediate encoding for the constant in cndmask generated from the ternary operation. This patch detects such cases and reuses the register from the compare instruction that already holds the constant, instead of materializing it again for cndmask. The optimization avoids immediates that can be encoded into cndmask instruction (including +-0.0), as well as !isNormal() constants. The change is reworked on the base of #131146 --------- Co-authored-by: Copilot <[email protected]>

dfukalov requested a review from arsenm March 13, 2025 14:06

llvmbot added the backend:AMDGPU label Mar 13, 2025

arsenm reviewed Mar 13, 2025

View reviewed changes

Update llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll

ffad027

Co-authored-by: Matt Arsenault <[email protected]>

added half and i16 types support

c5881f2

added tests for X == +0.0 ? X : Y and X == +0.0 ? +0.0 : Y cases

51827a6

jayfoad reviewed Mar 19, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp Outdated Show resolved Hide resolved

Update llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

ed642aa

Co-authored-by: Jay Foad <[email protected]>

jayfoad requested changes Mar 21, 2025

View reviewed changes

Do not substitute +0/-0 imm with register from comparison.

1ba6191

dfukalov requested a review from jayfoad March 25, 2025 16:19

dfukalov mentioned this pull request Jul 14, 2025

[AMDGPU] Try to reuse register with the constant from compare in v_cndmask #148740

Merged

dfukalov closed this Jul 14, 2025

[AMDGPU] Try to reuse in v_cndmask register with constant from compare. #131146

[AMDGPU] Try to reuse in v_cndmask register with constant from compare. #131146

Uh oh!

Conversation

dfukalov commented Mar 13, 2025

Uh oh!

llvmbot commented Mar 13, 2025

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

dfukalov Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

dfukalov Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dfukalov commented Mar 13, 2025

Uh oh!

jayfoad commented Mar 13, 2025

Uh oh!

arsenm commented Mar 13, 2025

Uh oh!

dfukalov commented Mar 17, 2025

Uh oh!

dfukalov commented Mar 17, 2025

Uh oh!

jayfoad commented Mar 18, 2025

Uh oh!

dfukalov commented Mar 18, 2025

Uh oh!

Uh oh!

jayfoad commented Mar 19, 2025

Uh oh!

dfukalov commented Mar 20, 2025

Uh oh!

jayfoad left a comment

Choose a reason for hiding this comment

Uh oh!

dfukalov commented Mar 21, 2025

Uh oh!

dfukalov commented Apr 9, 2025

Uh oh!

dfukalov commented Apr 29, 2025

Uh oh!

dfukalov commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants